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Abstract 

Generating local addresses and communication sets is an impor- 
tant issue in distributed-memory implementations of data-parallel 
languages such as High Performance Fortran. We show that for 
an array A affinely aligned to a template that is distributed across 
p processors with a cyclic(k) distribution, and a computation in- 
volving the regular section A(l : h : s), the local memory access 
sequence for any processor is characterized by a finite state ma- 
chine of at most k states. We present fast algorithms for computing 
the essential information about these state machines, and extend 
the framework to handle multidimensional arrays. We also show 
how to generate communication sets using the state machine ap- 
proach. Performance results show that this solution requires very 
little runtime overhead and acceptable preprocessing time. 

1 Introduction 

Languages such as Fortran D [4] and High Performance Fortran [5] 
allow the programmer to annotate programs with alignment and dis- 
tribution statements that describe how to map arrays to processors 
in a distributed-memory environment Both languages introduce 
an auxiliary Cartesian grid called a template ; arrays are aligned to 
templates, and templates are distributed across the processors. The 
alignment of an array to a template and the distribution of the tem- 
plate determine the final mapping of the array. The compiler must 
partition the arrays and produce “node code” for execution on each 
processor of a distributed-memory parallel computer. 

Consider an array A of size n aligned to a template such that 
array element A(t) is aligned to template cell ai -f* b. The template 
is distributed across p processors using a cyclic(k) distribution, in 
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n size of array A 

a stride of alignment of A to the template 

b offset of alig?iment of A to the template 

p number of processors to which the template is distributed 

k block size of distribution of the template 
l lower bound of regular section of A 
h upper bound of regular section of A 
a stride of regular section of A 

m processor number 


Table 1: Symbols used in this paper. 

which template cell t is held by processor (i div k) mod p. (We 
number array elements, template cells, and processors starting from 
zero.) Thus the array A is split into p local arrays residing in the local 
memories of the processors. Consider a computation involving the 
regular section A(t:h : s ), i.e., the set of array elements {A(t + 
js) : 0 < i < (A - l)j s}, where s > 0. (Table 1 summarizes the 
notation.) hi this paper we consider the following problems: 

• What is the sequence of local memory addresses that a given 
processor m must access while performing its share of the 
computation? 

• What is the set of messages that a given processor m must 
send while performing its share of the computation? 

The cyclic(k) distributions generalize the familiar block and 
cyclic distributions (which are cyclic( \n/p ] ) and cyclical) respec- 
tively). Dongarra et ai [3] have shown these distributions to be es- 
sential for efficient dense matrix algorithm^ on distributed-memory 
machines. High Performance Fortran includes directives for multi- 
dimensional eye lie (k) distribution of templates. 

Many special cases of these problems have been solved, but the 
general solution does not appear in the literature or (to the best of 
our knowledge) in any implementation. Koelbel and Mehrotra [8] 
treat special cases where the array is mapped using a block or a 
cyclic mapping, and the regular section has unit stride. Koelbel [7] 
handles non-unit-stride regular sections with block or cyclic map- 
pings. MacDonald et ai [9] discuss the problem in the context of 
Cray Research’s MPP Fortran, and conclude that a general solu- 
tion requires unacceptable address computation overhead in inner 
loops. They therefore restrict block sizes and number of processors 
to powers of two, allowing the use of bit manipulation in address 
computation. Experimental implementation of our solution shows 
that address computation and interprocessor communication can be 
performed efficiently without these restrictions. 


The paper is organized as follows. Section 2 solves the problem 
for a one-level mapping (a = 1,6 = 0). Section 3 reduces the 
problem for two-level mappings to combining the solutions from 
two subproblems involving one-level mappings. Section 4 extends 
the framework to handle multidimensional arrays, and Section 5 
extends it to handle generation of communication sets. Section 6 
discusses performance results, and Section 7 presents conclusions. 


Following equation (4), we can write AM in terms of the dif- 
ferences in course and offset: 

AM(%u * 2 ) = k • AC(<i,ta) + A<9(t’i, 1 * 2)1 (5) 

where the definitions of A C and A O are analogous to that of AM . 

Here are some examples. We assume that 1 = 0 and that h is 
large. 


2 One-level mappings 

We first consider the problem where a one-dimensional array A 
is aligned to a template with a — 1 and 6 = 0. This is called 
a one-level mapping . hi this case, the array and the template are 
in one-to-one correspondence, and we specify the problem by the 
pair (p, k) describing the distribution of the template across the 
processors. As shown in Figure 1(a), we visualize the layout of 
the array in the processor memories as courses of blocks. Three 
functions V, C t and O describe the location of array element A(i). 
7>(»‘;p,ik) is the processor holding . 4 ( 1 ), C(i;p, k) is the course 
of blocks within this processor holding A(i), and 0(i\ p, k) is the 
offset of A(») within the course. The layout of array elements in 
the local processor memories is as shown in Figure 1(b). 

These functions are defined as follows. 

7>(t;p, k) = (1 mod pk) div k = (i div k) mod p (1) 

C(i\ p, k) = t div pk — (t div k) div p (2) 

0(i\ p, k) = i mod k (3) 

Therefore, we have 

% = pk * C(i;p, k) + k * V(i\p , k) -f 0(i\p , k ). 

The following function gives the local memory location (with 
respect to the base of the local array) where element A(i) is stored 
within processor V{i\p, k): 

M (*; p,k) = k . C{ «; p, k) + C(i- p, k). (4) 

We are interested in the sequence of local memory locations that 
an individual processor, numbered m, accesses while performing 
its share of the computation on the regular section A(£ : h : 3 ). We 
specify the sequence of accesses by its starting location and by 
the differences AA^(t), 12 ) = M(\i\p, k) — k ), where *t 

and *2 are two successive indices of the part of the regular section 
mapped to processor m. As the examples in Figure 2 illustrate, the 
spacing AM is not constant. 

The key insight is that (for a particular p, k , and stride s) the 
offset 0(ii; p, k) of one element of the regular section determines 
the offset <9(iy, p, k) of the next element of the regular section in 
the same processor, and also determines the spacing AM(i\, 12 )* 
Thus we can represent the offset sequence and the AM sequence 
as a simple table indexed by offset. We visualize this table as the 
transition diagram of a finite state machine (FSM) whose k states 
are the offsets. This FSM is particularly simple because it has no 
input, and each state has only one outgoing transition; in fact, it is 
just a union of state-disjoint cycles. Thus the sequence of AM f s is 
periodic, with period almost k. 

The contents of the table, or the transitions of the FSM, depend 
on p, ik, and a, but not on the section bounds l and h or the processor 
number m. The same FSM describes the offsets for every processor, 
though each processor sees at most one cycle of the FSM. The first 
memory location used by the section in processor m (and its offset, 
which is the “start state” for that processor’s view of the FSM) 
depends on l and m as well as p, k , and s. 


Example 1 k = 4,p = 4, s = 3. The distribution is shown in 
Figure 2(a). The transition table T is as follows. For convenience, 
we also tabulate AC and AO. This shows a simple case where all 



Example 2 ]fc = 4,p = 3,s = 3. The distribution is shown in 
Figure 2(b). The transition table T is as follows. This demonstrates 
that the FSM for a problem can consist of disjoint cycles. Each 
processor sees at most one cycle of the FSM, depending on its start 



Example 3 k = 4,p = 3,s = 5. The distribution is shown in 
Figure 2(c). The transition table T is as follows. The AM sequence 



2.1 The algorithm 

Preparatory to finding a processor’s AM sequence, we solve the 
problem of finding its starting memory location. Given a processor 
number m, regular section A(£ : h : s), and layout parameters p and 
k , we wish to find the first element (if any) of the regular section that 
resides on that processor. This is equivalent to finding the smallest 
nonnegative integer j such that V[t -f sj\ p, k) = m. 

Substituting the definition of V from equation (1), we get 

((/ -f sj) mod pk) div k = m. 


This reduces to 

km < (l + sj) mod pk < k(m -f 1) - 1 
which is equivalent to finding an integer q such that 

km — l < sj - pkq < km - i + k — 1. (6) 
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Figure 1: A one-level mapping, (a) Layout parameters. Processors run across the page, while memory is organized as a two-dimensional 
array of courses and offsets. Each box represents a memory cell, and the number in the cell indicates the array element it holds, (b) Layout 
of array elements in local processor memories. Each processor’s memory is one row of the diagram. 
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Figure 2: One-level array mappings. The corresponding state tables are shown in Examples 1-3. (a) k = 4,p = 4, s = 3. (b) 
k = 4,p = 3,s = 3. (c) k = 4,p = 3, s = 5. 


We rewrite inequality (6) as a set of k linear Diophantine equa- 
tions in the variables j and g, 

{sj — pkq = A : km — i < A < km — l + k — 1}. (7) 

The equations in this set are to be solved independently (rather than 
simultaneously). Solutions exist for an individual equation if and 
only if A is divisible by GCD (a, pk). 

The general solution to the linear Diophantine equation sj — 
pkq = A is found as follows. Let d = GCD(s t pk) = as -b /3pk y 
with a, p £ Z. The quantities a, /?, and d are determined by the 
extended Euclid algorithm [6, page 325]. Then the general solution 
of the equation is the one-parameter family j = (Acr -b pkj)/ d and 
q — for 7 6 Z. 

For each solvable equation in set (7), we find the solution having 
the smallest nonnegative,;. The minimum of these solutions gives 
us the starting index value l -f js, which we then convert to the 
starting memory location using equation (4). We compute the 
ending memory location similarly, by finding the largest solutions 
no greater than h, and taking the maximum among these solutions. 

Now we extend this idea to build the AM sequence for the 
processor. A sorted version of the set of smallest positive solutions 
determined above gives us the sequence of indices that will be 


successively accessed by the processor. A linear scan through this 
sequence is sufficient to generate the AM sequence for the cycle of 
the FSM visible to processor m. If the FSM has multiple cycles, 
the local AM sequences may be different for different processors; 
otherwise, the sequences are cyclic shifts of one another. To avoid 
following the NEXT pointers at runtime, the AM table we compute 
contains the memory spacings in the order seen by the processor. 

All of these elements are combined in Algorithm 1, which pro- 
vides the solution for the memory address problem for a one-level 
mapping. The running time of Algorithm 1 is 0(logmin(s,pA;)) + 
O(Hogib), which reduces to O(vom(k\ogk -b logs.fclogfc -b 
logp)). As die output can be of size k in the worst case, any 
algorithm for this problem takes Cl(k) time. Finding an algorithm 
with O(ib) running time remains an open problem. Our implemen- 
tation recognizes the following special cases that can be handled 
more quickly: p = 1, s divides k f k = 1, pk divides s, a single 
course of blocks, and s divides pk. 

Algorithm 1 can be simplified by noting that the right hand sides 
of successive solvable equations differ by d, and that the identity 
of the first solvable equation is easily determined from the value of 
(km — t) mod d. This removes the conditional from the loop in 
lines 5-12. Similar incremental computations can also be used to 


3 








simplify the floor and ceiling computations in that loop. 

The node code for each processor consists of the following loop. 

base <— startmem; i <— 0 
while (base < lastmem) do 
compute with address base 
base base + AM [i] 
i (i + i) mod length 
enddo 


Algorithm 1 (Memory access sequence for a one-level 
mapping.) 

Input: Layout parameters (p, k ), regular section A{t :h :s ), 
processor number m . 

Output: The A M table, the length of the table, starting memory 
location, ending memory location, and ^ AM for 
processor m. 

Method: 

1 integer A M [k], length 

2 integer A, i, start, end, d , a, jd, lcourse , loffset, course, 

offset 

3 integer loc[k-hl ], sorted[k-hl ] 

4 (d, a,/?) <- Extended-EucLJD(s, pk); length <— 0; 

start h + l; end <— i — 1 

5 for A = km — i to km — t + k — 1 do 

6 if A mod d = 0 then /* Solvable equation */ 

7 locpength) <- £ -f ±(Aa + 

8 start <- minfstart, loc[length]) 

9 end ^-ma x(end,l + ^(Aq + L ^~ p£7* a * J)) 

10 length <— length -f 1 

1 1 endif 

12 enddo 

13 if length - 0 or start > h then 

14 return AM , ±, ±, -L, -L /* No work for processor */ 

15 endif 

1 6 sorted <— SORT (ioc, length ) 

17 sorted/lengthj sorted[0] -j-pks/d ; lcourse 

«— C(sorted[0]; p, k); loffset *— O(sorted[0]; p, k) 

18 for i - 0 to length - 1 do 

19 course <— C(sorted[i + 1 ]; p, k); offset 

«— 0(sorted[i + 1 ]; p, k) 

20 AM [i] <— k • (course - lcourse) + (offset - loffset) 

21 lcourse course; loffset <— offset 

22 enddo 

23 return AM , length, M (start; p, k), M (end; p, k), 

sk/GCD (s,pk) 


We illustrate the actions of Algorithm 1 on Example 3 for pro- 
cessor 0. Here, p = 3, k = 4, and s = 5. The extended Euclid 
algorithm returns d = 1, a = 5, and (3 = —2. Since each linear 
Diophantine equation is solvable, the loop in lines 5-12 is executed 
four times. At the end of this loop, ioc - [0, 25, 50, 15], length - 4, 
start — 0, and end — 50. After line 17, sorted - [0, 15, 25, 50, 60], 
At the end of the loop in lines 1 8-22, , A M - [7, 2, 9, 2], 

3 Two-level mappings 

Two-level mappings are solved by two applications of Algorithm 1. 
As an example, let A be an array with element A(i) aligned with 
template cell 3*, let the template be mapped with p = k = 4, and let 
the regular section be A(0 : 42 : 3), which we abbreviate to B. The 


picture is shown in Figure 3(a). The boxes represent template cells, 
the number j indicates the location of A(j), and numbers in squares 
are part of the regular section A(0 : 42 : 3). Two-level mappings 
encompass all mappings possible with the linguistic constructs of 
nested sections, affine alignment, and cycltc(k) distribution. 

Unlike the one-level case, all template cells do not contain array 
elements. However, we allocate local storage only for the occupied 
cells of the template. Thus, the local memory storage for A in 
the example above is as shown in Figure 3(b). We handle this 
complication by first doing all our computations with respect to the 
template (the local template space), and then switching back to the 
loci memory space in the final step. The picture in local template 
space is shown in Figure 3(c). 

Note that A and B are both aligned to “regular sections” of the 
template; A(») is aligned to template cell 3i, and B( i) is aligned 
to template cell 9*. We therefore create state tables for A and B 
with respect to the local template space using Algorithm 1 . The 
“AAf” entries returned by the algorithm are actually spacings in 
local template space; we therefore rename this column AT. This 
gives the tables for A and B shown below. 




Now all we need is to switch back to local memory space and 
fill in the AM entries. 

As we traverse the state table Ta for any processor m, we 
access each element of A stored on that processor, in order. But 
since storage is allocated based on A, this translates to accessing 
consecutive locations in local memory. Thus the AM entries for 
Ta are all unity. Now it is a simple matter to find the AM entries 
for the desired section B. For instance, the first line of Tb says 
that if we are at an element of the section that has offset 0, the 
next element has offset 2 and is 6 template cells away. To find 
the corresponding AM, we start in state 0 of Ta, and walk the 
state table, accumulating AT and AM counts. We stop when we 
have accumulated the desired AT count. (It is easy to show that at 
this point we will have arrived at the correct offset as well.) The 
accumulated AM value is the desired entry for that line of Tb . This 
procedure results in the following table for B. 1 


STATE 

NEXT 

AT 

AM 

0 

2 

6 

2 

1 

3 

6 

2 

2 

1 

15 

5 

3 

0 

9 

3 


The algorithm as presented above requires O(k) time to com- 
pute each AM entry of Tb- Two optimizations reduce this com- 
plexity to 0(1), allowing the final fixup phase to run in O(fc) time. 
First, we compute the sum of the AT and the AM around the state 

1 In this case, AAt turns out to be AT divided by the alignment stride a - 3, but 
this is not true in general. Tty the following case: a - 3, b -0,p - 3, tc - 8,s -3. 
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Figure 3: A two-level mapping of an array A aligned to the template with stride 3. (a) Layout of template cells. Numbers are array elements; 
boxed numbers are in the regular section A(0 : 42 : 3). (b) Local memory storage for array, (c) Local template space for array. 



Figure 4: Mapping a two-dimensional array. Layout of array elements on processor arrangement with p = k = 3 in the vertical dimension 
andp — k — 2 in the horizontal dimension. The boxed pairs are A( 0: 17: 2,0: 7: 3). 
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machine cycle, so that we can cast out any full cycles that must be 
traversed. Second, we tabulate the distance of each offset from a 
fixed origin (die numarcs table in Algorithm 2), allowing the com- 
putation of the distance between any two offset positions by two 
table lookups and some simple arithmetic. Consider die third line 
of Tb> which has a AT of 15. Using the precomputed sums, we 
determine in constant time that this corresponds to one full cycle 
phis 3 more template cells, and that the full cycle contributes a 
memory offset of 4. Now all we need is to determine the num- 
ber of memory cells encountered in going from offset position 2 
to offset position 1 in Ta . We find this distance in constant time 
using the numarcs table. Algorithm 2 shows the method with these 
optimizations incorporated. 


Algorithm 2 (Memory access sequence for a two-level 
mapping.) 

Input: Alignment parameters (a, b), layout parameters (p, k), 
array length n, regular section A(l: h : s), processor 
number m. 

Output : The AM table, length of the table, starting memory 
location, ending memory location, and the length of 
A in processor m. 

Method: 

1 integer i, mm, numaics[k] 

2 integer AM[k], length , startmem, iastmem 

3 integer AMI Pc], length 1, startmem 1, lastmeml, memcyclel 

4 integer AM2[k], length2, startmem2, lastmem2, memcycle2 

5 function residual(x) - 

length 1 •[ (x- startmeml)/mcmcyclel J 

6 function major(x) - (numarcs[x mod k] - 

numarcsfstartrneml mod kj) mod length 1 

7 function realmem(x) - major(x) + Tesidual(x) 

8 (AMI, length 1, startmem U lastmeml, memcyclel) 

Algorithm l(b, a(n — 1) -f b, a, p, k, m) 

9 (AM 2, length2 t startmem2, lastmem2, memcycle2) 

Algorithm l(a£ + b, ah + 6, as, p, k, m) 

10 if length 1 -» X or length 2 — _L then 

11 return AM, X, X, X, 0/* No work for processor */ 

12 endif 

13 mm +— startmem 1 

14 for i — 0 to length 1 — 1 do 

15 numarcs[mm mod k] «— i 

16 mm <— mm + AMl[i] 

17 enddo 

18 mm <— startmem2 

19 for i - 0 to length2 — 1 do 

20 A M[i] *- realmem(mm X AM2[i]) - realmem(mm) 

21 mm mm -f A M2[i] 

22 enddo 

23 return AM, length2, reahnem(startmem2), 

reahnem(lastmem2), realmem(lastmeml) — 
realmem(startmeml)+l 


4 Multidimensional arrays 

So far, we have characterized the access sequence of a single pro- 
cessor by a start address and a memory stride pattern, the latter 
being generated by the transitions of a FSM. For a multidimen- 
sional array, each dimension will be characterized by a (p, k) pair 
and have a FSM associated with it, and each processor will execute 


a set of nested loops. Instead of the start address being fixed as it 
was before, it too will be generated by the transitions of a FSM. For 
simplicity, we will deal with a two-dimensional array. The exten- 
sion to higher dimensions is straightforward. An array dimension 
with p = 1 is “mapped to memory”. 

Consider a one-level mapping of an array A( 0: 17, 0 : 7) with 
parameters (pi, fci) - (3, 3) in the first dimension and (p 2 , fc 2 ) — 
(2, 2) in the second dimension, and let the section of interest be 
A( 0: 17:2,0:7:3). The picture is shown in Figure 4. 

Let each block be laid out in column-major order in local mem- 
ory. Thus, the sequence of index values corresponding to successive 
memory elements of processor (0,0) is {(0,0), (1,0), (2,0), (9,0), 
(10,0), (11,0), (0,1), . . Such a decomposition is used by Don- 
garra et al for dense linear algebra codes on distributed-memory 
machines [3]. Then the tables corresponding to the two dimensions 


of A are 


and 



(Had the section been laiger in the second dimension, the col- 
umn accessedby processor (0, 0) immediately after column 0 would 
be column 9. The elements (0, 0) and (0, 9) would be 30 locations 
apart in local memory. This explains the AM entry in the first row 
ofr 2 .) 

The FSM T 2 is used to generate memory spacings for successive 
base addresses, while the FSM T\ is used to generate memory 
spacings between successive elements with the same base address. 
Algorithm 3 computes these FSMs. 


Algorithm 3 (Memory access sequence for a multidimen- 
sional array with a two-level mapping.) 

Input: Number of dimensions d\ alignment parameters 

(ttj , 6j ), layout parameters (p ; , kj), dimension length 
nj , regular section A(l 3 : h , : Sj), processor number 
ruj, for each dimension j (1 < j < d). 

Output: A Mj, length j, startmemj , lastmem^ (1 < j < d). 
Method: 

1 integer j, mem, m, k 

2 mem «— 1 

3 for j - 1 to d do 

4 (A Mj , length j , startmemj , lastmemj , m) 

<— Algorithm 2(aj , b Jt pj, kj , n )f / Jf hj , 3 Jf m } ) 

5 startmemj <— startmemj • mem 

6 lastmemj <— lastmemj • mem 

7 for k - 0 to length^ — 1 do 

8 AA^j[k] «— A Mj[k] • mem 

9 enddo 

10 mem «— mem • m 

1 1 enddo 

1 2 return AM } , length 3 , startmem } , lastmemj 

(l<J<d) 


For the two-dimensional case, each processor executes the fol- 
lowing doubly-nested loop. Higher-dimensional cases follow the 
same pattern, with one loop for each dimension. 
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base 2 <— startmemi 
h «— 0 

while bas &2 < lastmen 22 do 
basei startmemi 
ii -0 

while basei < iastmemi do 

compute with address base 2 + basei 
basei basei +AA4i[iiJ 
ii <— (ii 4- l)mod length \ 

enddo 

base2 «— base2 +AMi[h] 
h {h+ l)mod length 2 

enddo 

5 Generating communication sets 

The set of messages that a given processor m must send while per- 
forming its share of the computation is called its cornmunication 
set . We now extend the FSM approach to generate communication 
sets for regular communication patterns such as shifts and stride 
changes, hi this section, we assume a one-dimensional array with 
a single-level mapping. Extensions to multilevel and multidimen- 
sional cases are as in the address generation problem. 

We adopt KoelbeFs execution model for parallel loops [7], in 
which a processor goes through four steps: 

1. Send those data items it holds that are required by other 
processors. 

2. Perform local iterations, i.e ., iterations for which it holds all 
data. 

3. Receive data items from other processors. 

4. Perform nonlocal iterations. 

We combine the final two steps into a receive-execute loop. 

Our approach puts all the intelligence on the sending side, keep- 
ing the receive-execute loop simple. We wish to minimize the 
number of messages sent and maximize cache benefits. To this end, 
each processor makes a single pass over the data in its local memory 
using the FSM technique described above, figuring out the desti- 
nation of each data element and building messages. The messages 
contain (address, value) pairs which the receiving processor uses to 
perform its computations. All the data that must go from processor 
i to processor j are communicated in a single message. The data 
items corresponding to the local iterations are conceptually sent by 
the processorto itself. Of course, this message is never sent; rather, 
the local iterations are performed out of the message buffer. The 
idea is similar to the inspector-executor model [10] for irregular 
loops. 

5.1 Shifts 

Consider the computation 

A{0:h:s) = A(0:h:s) + A{l:l+h: s ), 

where the regular section A(t: l - j- k:s) istobe moved to align 
with A( 0: h : s). (For simplicity, assume t > 0.) hi this case, a 
processor co mmuni cates with at most two processors. We show 
how to augment the FSM with a AV and a A C column, such that 
adding these quantities to the processor and memory location of 
die current element gives the remote processor number and the 
memory location in the remote processor of the element with which 
it interacts. 


Suppose we know that element A(t) is located on processor m 
at memory location M{i\p y k) with offset O, = 0(i\p t k). Then 
the problem is to find die processor and local memory location of 
the element A(i — t). Let / = q *pfc + r, andAP = f(r — Oi)/k " |. 
Then V{\ - /;p, k) = (m - AT + p) mod p. Since 0 < O t < k, 
A V can assume only two values as the offset O, varies. 

We also define the quantity AC such that M(\ — /; p, fc) = 

M(i;p,k) + AC(Oi). 

A C(Oi) = ((O; - r) mod k) - O, - kq - tj, 

k if (mk - r + O,) < 0, 

0 otherwise. 

5.2 Stride changes 

Now consider the communication required to move the regular sec- 
tion A(0: S 2 ti : $ 2 ) to align with the regular section A(0: sin : s 1 ). 
This is harder than a shift; there is no simple lookup technique for 
generating the communication sets. The problem is that the pattern 
of destination processors can have period longer than k. The only 
fact that appears to be true in this case is that the sum of the periods 
over all processors divides pk. We therefore resort to a simple 
technique. 

We generate the FSM to reflect the storage properties of the S 2 - 
section. Now, the j th element of this section interacts with index 
i — j • $1 of A. Let q\ = t div pk , n = t mod pk , q 2 = n div k> 
and r 2 = n mod k . Then the desired element is stored in memory 
location q\ • k + r 2 on processor q 2 . 

Many modem RISC microprocessors perform integer division 
and remainder operations in software, which are rather slow. The 
equations above can be rewritten to require only two integer divi- 
sions, which substantially speeds up the implementation. If either 
p or k is a power of two, the computation can be sped up further by 
using shifts and bit masks instead of the division operations. Our 
implementation incorporates these optimizations. 

5.3 Termination 

Each processor must determine when it has completed all its opera- 
tions. As we do not send messages between every pair of processors, 
we cannot use the number of messages to determine termination. 
However, it is easy to figure out the total number of operations that 
processor m will execute (by a variant of Algorithm 1). A simple 
termination test initializes a counter to this value, decrements it by 
the number of array elements received in each message, and exits 
when the counter reaches zero. 

6 Experimental results 

In this section we present experimental results of implementing the 
algorithms above. We measure both the preprocessing cost of table 
generation and the runtime loop overhead. 

Our computational kernel is the DAXPY operation 

y(l:n:s) * y(l:n:s) + a*x(l:n:s). 

The ratio of memory operations to computation is high in this 
kernel, making it unfavorable for our algorithms. The overhead 
due to irregular memory access patterns would decrease with more 
computation in the loop. All experiments reported in this section 
were done on a Sparcstation 2, for which Dongarra [2] reports a 
100 x 100 double-precision LINPACK performance of 4.0 Mflops. 
(We measured 3.3 Mflops.) (A repetition of the experiments on 
a single processor of an iPSC/860 showed similar characteristics.) 
We used the GNU C compiler, with optimization at the -02 level. 
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Figure 5: Performance of DAXPY for various block sizes *. 


A higher op timiza tion level increased compilation time without 
improving the quality of the generated code. With 5 = 1 the 
loop compiled into 9 instructions, while the general case with table 
lookup compiled into 17 instructions. Our measured performance 
for the 5 = 1 case was 2.0 Mflops for long vectors. 

6.1 Local address generation 

We have seven parameters for a one-dimensional array: two align- 
ment parameters a and 6, two layout parameters p and fc, and three 
section parameters £, h , and s . This is too large a space over which 
to present results. We now explain how we reduce the number of 
variables. 

• Alignment stride a: For any constant c, the memory layout 
for parameters ca and ck is exactly the same as that for a 
and k. Thus we experiment only with a = 1; the effect of 
changmg a can be simulated by changing k instead. 

• Alignment offset 6: This is irrelevant for large sections. 
For s mall sections where end effects could be important we 
observed that changes in b affected results by less than 5%. 
We therefore take 6 = 0. 

• Section lower bound £: Again, this is irrelevant for large 
sections and has negligible effect even for small sections. 
We therefore take £ = 0. 


• Section upper bound h : This affects th e number of elements 
in the DAXPY, which in turn affects the tradeoff between loop 
startup and iteration time in a single processor. We vary h 
along with 5 , keeping h/s fixed, so that all our experiments 
do the same amount of work regardless of stride. 

• Section stride s; This is the z-axis of our plots, as one of the 
two independent variables. 

• Number of processors p: We run our experiments with a 
fixed number of processors. Varying the number of proces- 
sors has the effect of scaling the plots. 

• Block size k: This varies from plot to plot, as the second of 
the two independent variables. 

We use the execution time per element of the DAXPY loop as 
our measure of performance. For a given set of parameters, we 
compute the tables for each processor and time the execution of 
the distributed loop. Let r m be the time taken by processor m to 
execute the n m iterations it receives. Then the execution time per 
element is Tm 

t e = max 

m Tim 

This factors out the effect of load imbalance. : : 

We plot 2/U (the megaflops per second per processor) against 
stride s for various block sizes k. The plots are shown in Figure 5. 
The curves are the sum of two effects. 


8 



• -1 *-7 p-JJ 



Figure 6: Preprocessing overhead for memory stride computations. 
The graph corresponds to an alignment stride of 1 . The multiple 
points for each block size show the processing times for each of 32 
processors. 
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Table 2: Performance of communication set generation algorithms. 
All numbers are for 32 processors, (a) Shift communication: the 
shift distance is t. (b) Change of stride: the stride is changed from 
32 to 3 1. 


1. Baseline: The lower envelope is a smooth, relatively flat 
Junction of s. The value of the envelope is approximately the 
asymptotic performance for the single-processor DAXPY. 

2. Jiggles: The plot jiggles up and down some, presumably due 
to cache stride effects [1]. 

Although the general loop has almosttwice as many instructions 
as the unit-stride loop, the difference in performance is much less. 
This is because the execution time is dominated by the floating-point 
operations and references to operands in memory. The references 
to the AAT array usually hit in cache, and the additional loop control 
instructions probably execute in overlapped fashion. 

The preprocessing time required to compute the A AT tables is 
shown in Figure 6, where we plot the running time of Algorithm 1 
against block size k. The running time is approximately O(k), 
with certain special cases being handled much faster. Experimental 
observation confirmed that the running times for Algorithm 2 were 
no more than twice those of Algorithm 1 . 

If the A AT sequence is known at compile time, unrolling the 
loop bodies avoids the indirection into the AAT array. Figure 7 
shows that the benefits of this are mixed. While loop unrolling 
removes the indirection, it also increases the size of the loop body, 
thus the benefits of a simpler loop body may be undone by misses 
in the instruction cache caused by the larger loop body. This is a 
factor for larger block sizes, since the amount of unrolling maybe 
as large as k. We recommend unrolling only if the AAT sequences 
are short, or can be compacted, e.g., by run -length encoding. 

6.2 Communication set generation 

For the communication set generation algorithms, we measured the 
time required to generate the FSM tables and the time required to 
marshal the array elements into messages. We did not measure 
the actual co mmun ication time, since that varies widely among 
machines. The results are presented in Table 2. 

7 Conclusions 

Our table-driven approach to the generation of local memory ad- 
dresses and communication sets unifies and subsumes previous so- 
lutions to these problems. The tables required in our method are 


easy to compute, and the runtime overheads due to indirect address- 
ing of data are small. Our data access patterns exploit temporal 
locality and make effective use of cache. We also support blocking 
of messages to reduce message startup overheads. 

Our prototype implementation has demonstrated our perfor- 
mance claims. An optimized implementation of our techniques 
should deliver close to maximum performance for general cyclic(k) 
distributions of arrays. 

Our algorithms are fully parallel in that each processor of a 
parallel machine can generate its tables independently. Finding 
a fully parallel algorithm running in 0(k) parallel time or with 
0(p + k) total work remains an open problem. 
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